Skip to content

feat(evaluation): offline evaluation module with uv run evaluate CLI#280

Open
ShuxinLin wants to merge 1 commit intomainfrom
feat/evaluation-module
Open

feat(evaluation): offline evaluation module with uv run evaluate CLI#280
ShuxinLin wants to merge 1 commit intomainfrom
feat/evaluation-module

Conversation

@ShuxinLin
Copy link
Copy Markdown
Collaborator

Summary

  • New src/evaluation/ module: load saved agent trajectories + scenarios → grade → emit JSON report
  • Three graders: exact_string_match, numeric_match (deterministic), and a pluggable LLM judge with a six-criterion rubric
  • Per-task ops metrics (turns, tool calls, tokens, duration) + aggregate rollup (totals, p50/p95, optional cost estimate)
  • uv run evaluate CLI registered in pyproject.toml
  • 39 new unit tests; full repo suite stays green (309 passed)

Layout follows the three-stage run → evaluate → report pattern used by SWE-bench, HELM, and τ-bench. Re-grading from saved trajectories is first-class — no need to re-run the agent.

Closes #279

Test plan

  • uv run pytest src/evaluation/ -v — 39 passed
  • uv run pytest src/ -v -k "not integration" — 309 passed (no regressions)
  • CLI smoke test: uv run evaluate --trajectories <dir> --scenarios <file> --output report.json --grader-default exact_string_match produced the expected report
  • Grader override path verified: a per-scenario grading_method overrides --grader-default
  • Trajectory metric extraction tested for both SDK Trajectory dict and plan-execute list[StepResult] shapes
  • Reviewer: try --judge-model litellm_proxy/anthropic/claude-opus-4-5 against a real LiteLLM proxy on a small batch

Implement src/evaluation/ — consumes saved agent trajectories
({run_id}.json under AGENT_TRAJECTORY_DIR) and scenario files, joins
them on scenario_id, runs a registered grader per scenario, and emits
a JSON report combining grading results with operational metrics
(tokens, duration p50/p95, tool calls, optional cost estimate).

The shape follows SWE-bench / HELM / τ-bench conventions: agent run
→ evaluate → report.json, with offline re-grading from saved
trajectories as a first-class workflow.

Includes:
- Pydantic models (Scenario, PersistedTrajectory, GradeResult,
  OpsMetrics, EvalReport)
- Loader for trajectory dirs and JSON/JSONL scenario files
- Grader registry with two deterministic graders
  (exact_string_match, numeric_match) and a pluggable LLM judge
  bound to LLMBackend (six-criterion rubric)
- Per-task ops metric extraction (handles both SDK Trajectory and
  plan-execute list[StepResult] shapes) plus aggregate rollups
- Report writer with terminal summary and JSON output
- evaluate script registered in [project.scripts]
- 39 unit tests covering models, loader, graders, metrics, report,
  and end-to-end runner — all passing alongside existing 270 tests

Closes #279

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
@DhavalRepo18
Copy link
Copy Markdown
Collaborator

DhavalRepo18 commented Apr 28, 2026

https://mlflow.org/docs/latest/genai/concepts/scorers/ Please use these concept and prefer to use Scorer

  • Evaluator has multiple Scorer
    • LLM-As-Judge
    • Semantic-Score
    • Code-Based

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add evaluation module (src/evaluation/) with uv run evaluate CLI

2 participants